This documents reports the details behind the generation of the E. coli feature set. It is based mainly on RegulonDB’s curated data, and a few other sources detailed hereafter.

Ecoli feature set

The following sets are generated:

1. Gene set

A table generated by merging all gene-related information from RegulonDB and Zika, indexed by a consensus bnumber, and containing exhaustive synonyms for genes and their products.

2. TSS set and Promoter set

TSSs are gathered from RegulonDB, HT datasets (Mendoza-Vargas et al., 2009; Kim et al., 2012; Cho eet al., 2014; Thomason et al., 2014; Yan et al., 2018), and an unpublished collection from the Wade group. Merged on the basis of their coordinates and strand.

Promoters are regions containing 1 or more TSSs, where each TSS is at most 4 bp away from another TSS. A promoter opens with a TSS, and expands as long as there is another TSS in the following 4 positions (using a 5-nt window, meaning the maximum “empty” space between them is 3 nt). It closes with the last TSS of the sequence.

Promoter definition

3. Transcription unit set and co-transcribed genes set

Transcription units are defined by their unique coordinates and strand. Experimental TUs are extracted from RegulonDB, HT TUs from a public dataset (Yan et al., 2018), and custom orphan TUs are made of orphans genes. They are then merged on the basis of their coordinates and strand.

Every group of co-transcribed genes (CTG) is made of genes that are co-transcribed together at least once. The CTG set is derived from the TU set. TUs that contain exactly the same complete genes are grouped into a CTG sets, and the widest coordinates are kept. A given gene can be in several CTG sets, but 2 CTG sets cannot contains exactly the same genes.

Both sets come with a “operon_name” column. Here, an operon is a set of adjacent genes made of “one or several mutually overlapping transcription units that are transcribed in the same direction and share at least one gene”, as proposed by Mejía-Almonte et al. (by Mejía-Almonte et al., 2020). It is purely informative, and may not match with known operons.

TU and CTG definition

4. Binding sites set

Set of curated binding sites from RegulonDB. They are merged using their coordinates, TF bnumber and effect (+ or -). When the TF is a heterodimer, 2 entries are created: one per bnumber.

1 Gene set

Master gene table, with exhaustive synonyms gathered from RegulonDB, Ecocyc, Zika, etc.

–>

2 TSSs and promoters

The following files are created:

Feature_set_2021-09-21/tss_set_2021-09-21.tsv

  • tss_id: custom id (CTSSxxx)
  • start: +1
  • stop: +1
  • strand: “+” or “-”
  • name: depending on source
  • source: datasets from where it was collected
  • condition: depending on source
  • HT: boolean, 1 if appears in HT TSSs at least once, else 0
  • classic: boolean, 1 if appears in classic TSSs at least once, else 0
  • prediction: boolean, 1 if appears in prediction TSSs at least once, else 0
  • orientation: depending on source
  • type: “TSS”
  • alt_id: id from source

Feature_set_2021-09-21/promoter_set_2021-09-21.tsv

  • promoter_id: custom id (CPROMxxx)
  • start: left TSS +1 position
  • stop: right TSS +1 position
  • strand: “+” or “-”
  • type: “Promoter”

Feature_set_2021-09-21/tss_promoter_map_2021-09-21.tsv

  • promoter_id: custom id (CPROMxxx)
  • tss_id: custom id (CTSSxxx)

2.1 Sources

2.1.1 HT TSS

2.1.1.1 Wade data

  • Get Wade TSSs from Zika database

2.1.1.2 Morett data

  • Files were downloaded on 2021/03/19 here (Reference)
  • Genome coordinates were converted from version U00096.2 to version U00096.3 here
  • This dataset is composed of 3 files formatted differently:
    • One file has TSS left and right positions, unique position is selected depending on strand
    • Two files have associated gene’s left and right coordinates, and relative position of the TSS to the gene (unique position is calculated depending on strand). These files don’t have relative orientation information

2.1.1.3 Storz data

  • Files were downloaded on 2021/03/19 here (Reference)
  • Genome coordinates were converted from version U00096.2 to version U00096.3 here

2.1.1.4 Yan data

  • HT-inferred TSSs from PacBio long read data.
  • Get PacBio TUs from the Yan paper (reference)
  • Files downloaded from here

2.1.1.5 Palsson data

  • 2014 paper
  • Genome coordinates were converted from version U00096.2 to version U00096.3 here

2.1.2 Predictions (Huerta et al., 2003)

This set is not included

2.1.3 Classic experiments

2.1.3.1 RegulonDB

  • Built from the PromoterSet.txt file downloaded from RegulonDB website (version 10.8)
  • HT data is removed (to be used as an independent set)
  • Predicted TSSs are removed
  • Weak-evidence TSSs are removed

2.2 TSS set

  • TSSs are merged into single entries based on “start_stop_strand” duplicates.

2.3 Promoter set

Defined by TSSs merged using a 5bp-sliding window.

NB: coordinates-less TSSs are removed.

Promoter definition

2.4 TSS-promoter mapping

2.5 Summary

TSSs without filtering: 65409

TSSs after duplicate merging: 28987

Promoters: 23316

2.5.1 TSSs stats

2.5.1.1 TSSs merged by coordinates between sources

2.5.1.2 TSSs merged by coordinates within sources

2.5.2 Promoter stats

Promoters were built using a 5-bp-sliding window to group close by TSSs.

  • Average number of TSSs per promoter: 1.24
  • Average promoter size (nt): 1.4

2.5.2.1 TSSs merged by sliding window between sources

2.5.2.2 TSSs merged by sliding window within sources

3 Transcription units and co-transcribed genes

Transcription units are defined by their unique coordinates and strand. Experimental TUs are extracted from RegulonDB, HT TUs from a public dataset (Yan et al., 2018), and custom orphan TUs are made for remaining “orphans genes”, or genes that are not entirely covered by any TU. They are then merged on the basis of their coordinates and strand.

Every group of co-transcribed genes (CTG) is made of genes that are entirely co-transcribed together at least once. The CTG set is derived from the TU set. TUs that contain exactly the same complete genes are grouped into a CTG sets, and the widest coordinates are kept. A given gene can be in several CTG sets, but 2 CTG sets cannot contain exactly the same genes. Every gene from Zika’s genesView is present in at least one CTG set.

Both sets come with a “operon_name” column. Here, an operon is a set of adjacent genes made of “one or several mutually overlapping transcription units that are transcribed in the same direction and share at least one gene”, as proposed by Mejía-Almonte et al. (by Mejía-Almonte et al., 2020). It is purely informative, and may not match with known operons.

TU and CTG definition

Notes:

  • Terminators have left and right positions, so the “largest” end is kept
  • Operon names are made from overlapping CTGs that share at least one gene, they may differ from RegulonDB/Ecocyc operons.

The following files and fields are created:

Feature_set_2021-09-21/tu_set_2021-09-21.tsv

  • tu_id: custom id (CTUxxx)
  • start: left coordinate
  • stop: right coordinate
  • strand: “+” or “-”
  • source: datasets from where it was collected
  • tu_name: “valid” genes in tu (see definitions)
  • operon_name: “valid” genes in operon (see definitions)
  • init_type: one of “TSS”, “HT TSS”, “gene”
  • term_type: one of “TTS”, “HT TTS”, “gene”, “long”
  • reported_bnumbers: TU genes reported by the source
  • valid_bnumbers: TU genes that are entirely contained by the TU coordinates and IDed in Zika
  • flag_genes: boolean, 1 if valid_bnumbers != reported_bnumbers, else 0
  • classic: boolean, 1 if appears in classic TUs at least once, else 0
  • HT: boolean, 1 if appears in HT TUs at least once, else 0
  • orphan: boolean, 1 if TU is made of a gene that was not included in any of the previous TUs’ valid bnumbers
  • type: “TU”
  • alt_id: id from source

Feature_set_2021-09-21/ctg_set_2021-09-21.tsv

  • ctg_id: custom id (CCTGxxx)
  • start: left coordinate (widest)
  • stop: right coordinate (widest)
  • strand: “+” or “-”
  • reported_bnumbers: TU genes reported by the source
  • valid_bnumbers: TU genes that are entirely contained by the TU coordinates and IDed in Zika
  • type: “CTG”

Feature_set_2021-09-21/ctg_tu_map_2021-09-21.tsv

  • ctg_id: custom id (CCTGxxx)
  • tu_id: custom id (CTUxxx)

Feature_set_2021-09-21/ctg_gene_map_2021-09-21.tsv

  • ctg_id: custom id (CCTGxxx)
  • valid_bnumbers: gene’s “consensus bnumber” from the master gene table
  • rank: position in CTG unit (strand-wise, excluding non-valid genes)
  • Zika_gene_id: gene ID from Zika

3.1 Sources

3.1.1 RegulonDB

  • Get all TUs from RegulonDB by directly querying the database, with their associated promoter (if any), and first and last gene positions

  • Get terminators associated to TUs

NB: here I do not map terminators objects with TUs, only their position is used as a TU end coordinate

  • Merge experimental TUs, promoters, terminator and gene coordinates

    • get the promoter position as start position if available, otherwise first position of the first gene in the TU (add flags)
    • get the terminator position as end position if available, otherwise last position of the last gene in the TU (add flags)
    • it is worth noting that some TUs (not operons) are associated with distinct terminators, which causes some redundancy
    • a few genes that are associated to a TU in RegulonDB don’t have coordinates, so the master gene file is used to get them

3.1.2 PacBio (HT)

  • Get PacBio TUs from the Yan paper (reference)

  • Files provided by Victor: link

  • A few bnumbers have to be updated to new ones (using master gene file):

    • Old: b0255,b0257,b1016,b1017,b1416,b1417,b1509,b1510,b2031,b2090,b2138,b2999,b3000,b3767,b3768,b4540
    • New: b2139,b4488,b4490,b4493,b4498,b4571,b4587,b4658,b4696
  • This may change artificially the number and order of genes in those TUs. For example, the TU “b1417,b1416” becomes “b4493”.

  • Custom IDs are created for PacBIO TUs as follows: PB_GC_TUdefinition_XXX

    • GC: Growth condition (M9 or Rich)
    • TUdefinition: method to get TU boundaries: TU is defined by either a defined TSS and TTS pair (definedEnd) or the longest read from a defined TSS (longestRead).

3.1.3 Orphan TUs

  • Get valid genes contained in TUs

    • Genes that are entirely contained by the TU boundaries
    • Genes that have a gene id in Zika (this excludes most phantom and pseudo genes, and a few small RNA-coding genes)
  • Get genes that are not present and valid in a TU from RegulonDB or PacBio and make them “orphan TUs”

  • Alternative IDs are created for orphan TUs as follows: orphan_XXX

3.2 TU set creation

  • TUs are merged into single entries based on “start_stop_strand” duplicates

3.3 Co-transcribed genes set

  • TUs are merged into single entries based on valid gene content duplicates

  • A table is created to map TUs with CTGs

3.4 Summary

  • Total TUs: 8511

  • Total TUs without duplicate coordinates: 8221

  • Total TUs without duplicate gene content (CTG set): 4283

3.4.1 All TUs

3.4.1.1 Sources

3.4.1.2 Type of initiation and termination of TUs

Based on all collected TUs before any sort of merging.

3.4.2 TU set

3.4.2.1 Coordinates duplication between datasets

3.4.2.2 Coordinates duplication within datasets

NB: y axis is the number of TUs duplicated, x axis is the duplication factor

3.4.3 CTG set

3.4.3.1 Gene content duplication between datasets

3.4.3.2 Gene content duplication within datasets

NB: y axis is the number of TUs duplicated, x axis is the duplication factor

4 Known binding sites

Based on RegulonDB version 10.8, downloaded here.

Weak-evidence sites are removed.

The following files and fields are created:

Feature_set_2021-09-21/tfbs_set_2021-09-21.tsv

  • tfbs_id: custom id (CTFBSXXX)
  • start: left position
  • stop: right position
  • TF_name: TF protein name
  • TF_bnumber: consensus bnumber from the master table of the gene(s) coding the TF
  • effect: repressor or activator
  • evidence: type of evidence, experiment, prediction, etc
  • confidence: depending on evidence, one of “Weak”, “Strong” or “Confirmed”
  • type: “TFBS”
  • alt_id: id from source
  • Zika_gene_id: gene id from Zika of the gene(s) coding the TF

5 Feature mapping

The following file is created:

Feature_set_2021-09-21/feature_map_2021-09-21.tsv

  • tss_id: custom id (CTSSXXX)
  • tu_id: custom id (CTUXXX)
  • tfbs_id: custom id (CTFBSXXX)